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Abstract 

Background: RNA polymerase II (Polll) is essential in gene transcription and ChlP-seq experiments have been used 
to study Polll binding patterns over the entire genome. However, since Polll enriched regions in the genome can 
be very long, existing peak finding algorithms for ChlP-seq data are not adequate for identifying such long regions. 

Methods: Here we propose an enriched region detection method for ChlP-seq data to identify long enriched 
regions by combining a signal denoising algorithm with a false discovery rate (FDR) approach. The binned ChlP- 
seq data for Polll are first processed using a non-local means (NL-means) algorithm for purposes of denoising. 
Then, a FDR approach is developed to determine the threshold for marking enriched regions in the binned 
histogram. 

Results: We first test our method using a public Polll ChlP-seq dataset and compare our results with published 
results obtained using the published algorithm HPeak Our results show a high consistency with the published 
results (80-100%). Then, we apply our proposed method on Polll ChlP-seq data generated in our own study on the 
effects of hormone on the breast cancer cell line 1\/1CF7. The results demonstrate that our method can effectively 
identify long enriched regions in ChlP-seq datasets. Specifically, pertaining to )\/lCF7 control samples we identified 
5,91 1 segments with length of at least 4 Kbp (maximum 233,000 bp); and in MCF7 treated with E2 samples, we 
identified 6,200 such segments (maximum 325,000 bp). 

Conclusions: We demonstrated the effectiveness of this method in studying binding patterns of Polll in cancer 
cells which enables further deep analysis in transcription regulation and epigenetics. Our method complements 
existing peak detection algorithms for ChlP-seq experiments. 



Background 

Chromatin immunoprecipitation combined with next 
generation sequencing technology (ChlP-seq) has been 
swiftly adopted as a standard technique for studying 
genome wide protein-DNA interaction patterns during 
the past four years. It is applied in gene regulation stu- 
dies for identify^ing transcription factor targets and bind- 
ing motifs, as well as in epigenetics research towards the 
characterization of chromatin states using various 
histone marks and RNA polymerase II (Polll) [1-3]. 

Polll plays an essential role in gene transcription. 
During transcription, it is responsible for the synthesis 
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of nascent messenger RNA molecules (mRNA) for pro- 
tein-coding genes and microRNAs [4]. The nascent 
mRNAs then go through a series of processing steps 
including splicing to form mature mRNAs. To tran- 
scribe a gene, Polll will undergose several steps includ- 
ing recruitment, initiation, elongation, and dissociation 
[4,5]. In addition, Polll pausing and pre-mature dissocia- 
tion will cause stalling of the transcription process [4,5]. 
Thus, accurately characterization of Polll binding pat- 
terns over the entire genome is of great importance in 
studying the dynamics of transcription as well as contri- 
buting to the characterization of nascent mRNA, which 
cannot be directly inferred from gene expression micro- 
array or regular RNA-seq technologies since they focus 
on mature mRNA. However, since during transcription 
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PolII elongates along the entire gene, the PolII binding 
pattern over a gene is usually not just a single peak but 
forms elongated regions as manifest in ChlP-seq data. 
PolII enriched regions can stretch to several thousands 
of basepairs (Figure 1). Traditionally, ChlP-seq data ana- 
lysis methods rely on peak region detection algorithm to 
delineate genomic regions with enriched protein bind- 
ings. However, the binding pattern of PolII poses a very 
different paradigm of computing and in turn significant 
challenges. Several peak detection algorithms were 
developed for delineating transcription factor binding 
sites and the anticipated regions are short (e.g., 200- 
1500 bp) [6-12] thus rendering such algorithms inade- 
quate for studying proteins with prevalent binding over 
the entire genome such as PolII. 

While ChlP-seq data can be considered a 1-D signal 
over the entire genome, only a few studies explicitly take 
advantage of signal denoising and detection methods 
developed in the engineering community. For instance, 
in [13], wavelet denoising technique was applied to filter 
the ChlP-seq data to identify nucleosome distribution 
patterns. For histone marks, a method called SISSR was 
recently developed [14], which takes a multiscale 
approach to analyze ChlP-seq data. This approach first 
identifies potential regions with enriched histone patterns 
and then links proximal regions which are separated by 
short intervals as a contiguous large region. The short 
intervals can be considered "noise" in the genome-wide 
signal that can be filtered out at coarser scales. 

In this paper, we also consider a ChlP-seq dataset a 
noisy 1-D signal stretched over the genome and apply 
signal processing techniques combined with statistical 



analysis for identifying large enriched regions in ChlP- 
seq data. Our approach includes three main steps. First 
we directly apply a signal denoising algorithm to process 
histogram of ChlP-seq data. Noise in ChlP-seq data ori- 
ginates from multiple sources including non-specific 
binding to the antibody, artifacts in amplification (local 
PGR), and sequencing reads mapping errors. These con- 
founding noises are usually modeled as spike noise with 
an underlying Poisson distribution. Recently it has been 
shown that the non-local means (NL-means) algorithm is 
highly effective in signal denoising compared to other 
commonly used algorithms such as median filter, low- 
pass filtering and wavelet denoising [15,16]. It should be 
noted that our proposed method is based on the 
NL-means denoising algorithm. Since the NL-means 
algorithm is optimal for Gaussian noise, we apply the 
Anscombe transformation to the binned GhlP-seq [17]. 
Consequently, the underlying noise distribution model is 
now approximated by a Gaussian distribution. Subse- 
quently, we apply the NL-means algorithm followed by 
inverse Anscombe transform. The denoised data is then 
compared against random model to determine the 
threshold for region selection based on a false discovery 
rate (FDR) approach. Finally, regions are selected based 
on the threshold, the region length, and the ratio between 
the peak value and the threshold. 

To evaluate our method, we first test our method using 
a public PolII GhlP-seq dataset and compare the results 
with published results obtained using the published algo- 
rithm HPeak [18]. Then, we apply our proposed method 
on PolII GhlP-seq data generated in our own study on 
the effects of hormone on breast cancer cell line MGF7. 
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Figure 1 Examples of PolII ChlP-seq data for MCF7 cell line. ChlP-seq data for PolII binding pattern on SEMA3C in MCF7 cell control 

samples. The top lane shows the histogram of the PolII binding densities over a range of genome. The gene covered by this range is shown in 

the bottom lane. In the bottom lane, the thick bars below the gene symbol indicate exons of the gene while the blue arrow indicates its 

orientation. The tail and head of the arrow correspond to the transcription starting site (TSS) and transcription ending site (TES) of the gene 

respectively. The same arrangements are also applied to the other figures. It is apparent that PolII not only binds to the TSS regions of the gene 

but also form long enriched regions over the entire transcript. 
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We compare the long segments obtained when tested 
against both MCF7 cell line and the MCF7 cells treated 
with 17-estradial (E2) hormone. The results demonstrate 
that our method can effectively identify both long 
enriched regions in ChlP-seq datasets and complement 
existing peak finding algorithms for a variety of potential 
wide applications such as histone mark binding pattern 
study. 

Methods 

Data selection and pre-processing 

To test our method, we first downloaded a ChlP-seq 
dataset for PolII on a prostate cancer cell line (LnCap) 
after treatment with an androgen receptor agonist 
R1881 (GSM353618) from NCBI Gene Expression 
Omnibus (GEO). The dataset was generated using lUu- 
mina Genome Analyzer II (GAII) sequencer and was 
analyzed using the HPeak algorithm [3]. In addition, we 
apply our algorithm in PolII ChlP-seq data generated in 
our research on breast cancer using MCF7 cell line. The 
accession numbers are GSM529981 (MCF7 control), 
and GSM529982 (for MCF7 treated with 17-estradial 
(E2)). For all the datasets, we downloaded the sequence 
mapping results (using the Eland algorithm provided by 
lUumina Inc.) and then generated histogram with 
selected bin size. 

Introduction of NL-means algorithm and parameter 
selection 

The NL-means algorithm was originally developed for 
image denoising [15] and was also applied to 1-D signals 
(including acoustic signals) [19] and video [20]. The 
details of the original NL-means algorithm are given in 
[15]. Here we briefly describe the essential formulation 
and the salient parameters. Basically, for each data 
point, its value is replaced by a weighted average of data 
points over the entire signal such that points with simi- 
lar neighbourhoods are given higher weights. It has 
been shown that the NL-means algorithm is optimal for 
Gaussian noise. In addition, using weighted averaging 
for points with similar neighbourhoods is suitable for 
ChlP-seq experiments since the pattern of protein bind- 
ing are considered similar across many regions of the 
genome. 

Formally, given a signal with data points X={xi,i = 1,..., 
N}, the filtered value at Xi is defined as 

N 

NL[Xi] = i^'j) ^ where is a difference measure 
between Xi and a neighbouring point Xj under the con- 

N 

straints w(/,;)>0 and = ^ Specifically, 

^ii A = ^^P(-||A/'(xO-Ar(.,)||-/(2a^)) ^ ^^^^^ 



normahzmg factor, Z(i) = J^expi-- — ^^^^> 

i 2^ 
and A/" (Xi) denote a fixed-size neighbourhood centred 
at the position /. In practice, searching for similar neigh- 
bourhood patterns over the entire genome is not feasi- 
ble. Instead, a parameter specifying the range of search 
is needed. In summary, the NL-means algorithm 
requires three parameters, the size of neighbourhood 
the range of search and the weight parameter We evalu- 
ated a series of parameter combinations (Figure 2) and 
in this study we use R = 10{bins) L = 15, and (7 = 10 

Anscombe transformation 

Since NL-means algorithm is optimal for Gaussian noise 
while the noise in ChlP-seq data is usually modelled 
using Poisson distribution, we apply Anscombe trans- 
form to the data first. Anscombe transform is a com- 
monly used variance-stabilizing transformation that 
transforms a Poisson distribution random variable into 
one with that is close to a Gaussian distribution [17,21]. 
The Anscombe transformation on data point is defined 
as follows: y = 2y/x + 3/8 and the inverse Anscombe 
transform is given by x={Y/2f'3/8, 

FDR based approach 

To select a threshold for region selection, we designed a 
FDR approach based on B random simulations. Given 
the histogram of the ChlP-seq data with M bins and the 
reads over each bin as (/ = 7, 2,..., M), we take the fol- 
lowing steps: 

1. In the b-th round of simulation {b = 1, 2,..., B), we 
randomly (Poisson distribution) position the same 
amount of sequencing reads for each chromosome as in 
the original data and then apply the NL-means algo- 
rithm. The reads over the i-th bin is denoted as r*^ . 

2. Then for the i-th bin in the histogram, the ratio 
that the observed data is less than simulated data is 

B 

recorded as = XI U^i < r.*^). 1(A) takes value 1 (resp. 

b=i 

0) if A is true (resp. false). 

3. For the b-th round of simulation, we treat it as a 
"true" signal and compare it with rest of simulated data. 

B 

For the /-th bin, we compute p*^ = I (r*^ < r*^;)Then 

h'=i 

for a threshold Pcut> the number of false peaks in this 

M 

round of simulation is d}) = J2 ^(P/^ < Pcut) 

1=1 

4. The false discovery rate for the cutoff p^ut is 
FDR{p,u,) = - tl . 

J^l{pi Spent) 
1=1 
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Figure 2 Denoising on Polll ChlP-seq data for MCF7 cell line using NL-means algorithm. This region is over tine gene ELAC2. Different 
parameters for tine NL-means algoritlim are tested on tine ClilP-seq data. Tine second panel uses tine parameter sets used in tine rest of tliis 
paper (/?=10(6/>7s),L=15, and G=^0) 



Since p^ut is a function of the region height, we obtain 
the FDR for selected threshold on height Then for each 
detected region based on the FDR approach, we also 
calculate the ratio between the peak value and the 
threshold. 



Results 

Comparing with HPeak for detecting Polll enriched 
regions in prostate cancer 

We first apply our method to a published Polll ChlP- 
seq data in prostate cancer model (GSM353618). We 
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compare our results with the published 15,833 PolII 
enriched region detected using HPeak [3,18]. As shown 
in Table 1, when FDR = 0.01, our method show a high 
overlap with published results with overlap ratio (com- 
paring to HPeak) of 80-100%. We then inspected many 
of regions that were detected using our method but 
were missed by HPeak. We found many of them corre- 
spond to potential transcription starting sites (TSS) for 
unknown genes or non-coding RNA genes. Figure 3A-B 
show examples of regions detected using our method 
using 0.01 for FDR and 6.5 for Peak-to-threshold ratio 
but were missed by the HPeak algorithm even though 
they imply the potential transcription activity for the 
covered genes. One of them is a snoRNA gene (U5E). 
Figure 3C shows examples of regions that were detected 
by HPeak but were missed by our method. In fact, using 
this setting, all regions missed by our method contain 
narrow peaks (covers one to two 25 bp bins). This 
observation suggests that the regions detected using our 
method can provide addition information regarding 
gene transcription and annotation. Further detailed 
annotation on these new regions is currently being car- 
ried out by integration with other information including 
CpG island distribution, DNase I hypersensitivity, and 
H3K4me2 binding (ChlP-seq). 

Detecting large regions with Polll enrichment in breast 
cancer cell line MCF7 

We then applied our method in the PolII ChlP-seq data- 
sets for the breast cancer cell line MCF7 using large bin 
size (1,000 bp). As shown in Table 2, pertaining to MCF7 
control samples we identified 5,911 segments with length 
of at least 4 K bp (ranging from 4,000 bp to 233,000 bp); 
and in MCF7 treated with E2 samples, we identified 6,200 
such segments (ranging from 4,000 bp to 325,000 bp). 
Some examples of the long regions detected in MCF7 con- 
trol samples are shown in Figure 4. 



Table 1 Comparing our method with IHPealc (bin size 
25 bp). 



FDR 


Peak/Threshold Ratio 


# of Regions 


Ratio of Overlap with 
HPeak (%) 


0.05 


5 


108393 


100 


0.05 


10 


35243 


100 


0.03 


9 


24213 


100 


0.03 


10 


20421 


93.09 


0.02 


8 
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98.11 


0.02 


10 


15612 


80.90 


0.01 


6 


22859 


100 


0.01 


6.5 


20358 


96.65 


0.01 


8 


15302 


80.01 



Discussion 

Identification of long segments for PolII binding are 
important for further investigation for understanding 
gene transcription regulation as well as potentially dis- 
covery novel transcripts and alternative promoters. For 
gene transcription, while PolII binding density at pro- 
moter around the TSSs was considered to determine 
gene transcription levels, recent studies show that the 
density of PolII binding on gene body is also critical 
[5,22]. We also observed such phenomena using the 
above identified segments. For instance, as shown in 
Figure 5, a segment of 16,000 bp has been identified 
over the transcript of the gene PLK2 on human chro- 
mosome 5. The MCF7 control sample has more sequen- 
cing reads over this region than the MCF7 sample 
treated with E2 sample (958 vs 454 reads with similar 
amount of total reads in chromosome 5 between the 
two samples). Although, the height of the "peak" at the 
TSS region in the MCF7 control sample is lower than 
that in the MCF7 E2 treated sample, the total transcrip- 
tion level (measured by Affymetrix gene expression 
array) is still higher in MCF7 control by a factor of 
3.95-fold (Student t-test p = 3.872 x 10'^). 

The above observations suggested that ChlP-seq tech- 
nology can reveal potential new insights and principles 
in biology. The method presented in this paper will con- 
tribute to such discovery efforts. Nevertheless, this 
method can be improved in several aspects. First, cur- 
rently the parameters for NL- means algorithm are fixed. 
In practice, the user may focus on enriched regions of 
certain length and this could lead to the change of these 
parameters. Therefore, a multiscale approach is pre- 
ferred. Second, an important utility of ChlP-seq data 
analysis is to compare enriched regions between differ- 
ent samples such as ChIP sample versus its input con- 
trol or control sample versus drug treated sample. 
Currently we are implementing a Fisher's exact test 
based approach to enable such comparison. Last but not 
the least, this method can be appUed not only to PolII 
ChlP-seq data but can also be used for analyzing other 
data such as histone mark ChlP-seq data for integrative 
genomic analysis. Currently we are also expanding our 
study to integrate ChlP-seq data for important histone 
marks including H3K4me2 and H3K27me3 in an epige- 
netic study. 

Conclusions 

In this paper, we propose an enriched region detection 
method for ChlP-seq data to identify long enriched 
regions by combining a signal denoising algorithm with 
a FDR approach. Our method complements existing 
peak detection algorithms for ChlP-seq experiments. 
We demonstrated the effectiveness of this method by 
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Figure 3 Comparison of HPeal^ and the proposed algorithm in detecting regions on the prostate Polll ChlP-seq data (parameters of 
25 bp for bin size, 0.01 for FDR and 6.5 for Peak-to-threshold ratio). The bin size is tine same between the two algorithms. A: A region 
detected using the proposed algorithm but not by HPeak. Top: the original histogram of the ChlP-seq data. Bottom: the NL-means filtered data 
and the detected regions (marked by the red bar). The length of the detected region (10375 bp) is marked under the red bar. A gene (NEATl, 
marked by black bar) is covered by this region. B: Another region detected using the proposed algorithm but not by HPeak. Top: the original 
histogram of the ChlP-seq data. Bottom: the NL-means filtered data and the detected regions (marked by the red bar, 2975 bp long). A snoRNA 
gene (U5E) and its pseudogene (marked by black bars) are covered by this region. C: Two regions detected by HPeak but not the proposed 
algorithm. The HPeak detected regions are marked using green bars (the lengths are 675 bp and 200 bp, respectively). We examined all regions 
that were missed by the proposed algorithm under this set of parameters, all regions contain short spike-like peaks as shown here. 



Han et al. BMC Bioinformatics 2012, 13(Suppl 2):S2 
http://www.biomedcentral.eom/1 471-21 05/1 3/S2/S2 



Page 7 of 9 



Table 2 Long enriched regions identified in Poll! ChlP-seq data in MCF7 cells using bin size of 1,000 bp. 



Sample 


# of regions with length 


> 4,000 bp 


# of regions with length > 


10,000 bp 


Maximum length (bp) 


MCF7 control 




5,911 




1,992 






233,000 


MCF7 + E2 




6,200 




2,310 






325,000 
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Figure 4 Examples of long regions of Poll! binding detected using our algorithm. Red bars indicate the detected regions. The segments 
are shown above the patterns over corresponding genes. The lengths of the segments are listed in the figure. The four genes (from top to 
bottom) are SEMA3C, TRIM37, SETD5, and BMP7. 
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Figure 5 The Polll binding patterns and expression levels for PLK2 gene. Top: the Polll binding patterns for PLK2 gene in control (first lane) 
and E2 treated samples (second lane). Polll shows a higher peak for the E2 treated sample but lower total amount of binding over the transcript. 
Red bar indicates the 16 Kbp region detected using our method in MCF7 and blue bar indicates the 14 Kbp region detected in MCF7+E2. 
Bottom: the expression levels of PLK2 gene in the two different conditions (n = 4). Error bar is for standard deviation. 
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studying the binding patterns of PolII in cancer cell 
lines which enables further deep analysis with applica- 
tions in transcription regulation and epigenetics. 
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